Lab 5: Intro to Machine Learning

Practice session

Luisa M. Mimmi —   https://luisamimmi.org/

December 11, 2024

GOAL OF TODAY’S PRACTICE SESSION

  • In this Lab session, we will focus on Machine Learning (ML), as introduced in Lecture 5
  • We will review examples of both supervised and unsupervised ML algorithms
    • Supervised ML Example
    • Logistic regression
    • 🌳 Random Forest / decision trees 🌲
    • Unsupervised ML Example
      • K-means Clustering
      • PCA for dimension reduction
    • (optional) PLS-DA for classification, a supervised ML alternative to PCA

🟠 ACKNOWLEDGEMENTS

The examples and datasets in this Lab session follow very closely two sources:

  1. The tutorial on “Principal Component Analysis (PCA) in R” by: Statistics Globe

R ENVIRONMENT SET UP & DATA

Needed R Packages

  • We will use functions from packages base, utils, and stats (pre-installed and pre-loaded)
  • We may also use the packages below (specifying package::function for clarity).
# Load pckgs for this R session

# --- General 
library(here)     # tools find your project's files, based on working directory
library(dplyr)    # A Grammar of Data Manipulation
library(skimr)    # Compact and Flexible Summaries of Data
library(magrittr) # A Forward-Pipe Operator for R 
library(readr)    # A Forward-Pipe Operator for R 

# Plotting & data visualization
library(ggplot2)      # Create Elegant Data Visualisations Using the Grammar of Graphics
library(ggfortify)     # Data Visualization Tools for Statistical Analysis Results
library(scatterplot3d) # 3D Scatter Plot

# --- Statistics
library(MASS)       # Support Functions and Datasets for Venables and Ripley's MASS
library(factoextra) # Extract and Visualize the Results of Multivariate Data Analyses
library(FactoMineR) # Multivariate Exploratory Data Analysis and Data Mining
library(rstatix)    # Pipe-Friendly Framework for Basic Statistical Tests

# --- Tidymodels (meta package)
library(rsample)    # General Resampling Infrastructure  
library(broom)      # Convert Statistical Objects into Tidy Tibbles

DATASETS for today


In this tutorial, we will use:

Dataset on Breast Cancer Biopsy

Name: Biopsy Data on Breast Cancer Patients
Documentation: See reference on the data downloaded and conditioned for R here https://cran.r-project.org/web/packages/MASS/MASS.pdf
Sampling details: This breast cancer database was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. He assessed biopsies of breast tumours for 699 patients up to 15 July 1992; each of nine attributes has been scored on a scale of 1 to 10, and the outcome is also known. The dataset contains the original Wisconsin breast cancer data with 699 observations on 11 variables.

Importing Dataset biopsy

  • The data can be interactively obtained form the MASS R package
# (after loading pckg)
# library(MASS)  

# I can call 
utils::data(biopsy)

biopsy variables with description

Variable Type Description
ID character Sample ID
V1 integer 1 - 10 clump thickness
V2 integer 1 - 10 uniformity of cell size
V3 integer 1 - 10 uniformity of cell shape
V4 integer 1 - 10 marginal adhesion
V5 integer 1 - 10 single epithelial cell size
V6 integer 1 - 10 bare nuclei (16 values are missing)
V7 integer 1 - 10 bland chromatin
V8 integer 1 - 10 normal nucleoli
V9 integer 1 - 10 mitoses
class factor benign or malignant

biopsy variables exploration 1/2

The biopsy data contains 699 observations of 11 variables.

The dataset also contains a character variable: ID, and a factor variable: class, with two levels (“benign” and “malignant”).

# check variable types
str(biopsy)
'data.frame':   699 obs. of  11 variables:
 $ ID   : chr  "1000025" "1002945" "1015425" "1016277" ...
 $ V1   : int  5 5 3 6 4 8 1 2 2 4 ...
 $ V2   : int  1 4 1 8 1 10 1 1 1 2 ...
 $ V3   : int  1 4 1 8 1 10 1 2 1 1 ...
 $ V4   : int  1 5 1 1 3 8 1 1 1 1 ...
 $ V5   : int  2 7 2 3 2 7 2 2 2 2 ...
 $ V6   : int  1 10 2 4 1 10 10 1 1 1 ...
 $ V7   : int  3 3 3 3 3 9 3 3 1 2 ...
 $ V8   : int  1 2 1 7 1 7 1 1 1 1 ...
 $ V9   : int  1 1 1 1 1 1 1 1 5 1 ...
 $ class: Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...

biopsy variables exploration 2/2

There is also one incomplete variable V6

  • remember the package skimr for exploring a dataframe?
# check if vars have missing values
biopsy %>% 
  # select only variables starting with "V"
  skimr::skim(starts_with("V")) %>%
  dplyr::select(skim_variable, 
                n_missing)
# A tibble: 9 × 2
  skim_variable n_missing
  <chr>             <int>
1 V1                    0
2 V2                    0
3 V3                    0
4 V4                    0
5 V5                    0
6 V6                   16
7 V7                    0
8 V8                    0
9 V9                    0

biopsy dataset manipulation

We will:

  • exclude the non-numerical variables (ID and class) before conducting the PCA.

  • exclude the individuals with missing values using the na.omit() or filter(complete.cases() functions.

  • We can do both in 2 equivalent ways:


with base R (more compact)

# new (manipulated) dataset 
data_biopsy <- na.omit(biopsy[,-c(1,11)])

with dplyr (more explicit)

# new (manipulated) dataset 
data_biopsy <- biopsy %>% 
  # drop incomplete & non-integer columns
  dplyr::select(-ID, -class) %>% 
  # drop incomplete observations (rows)
  dplyr::filter(complete.cases(.))

biopsy dataset manipulation

We obtained a new dataset with 9 variables and 683 observations (instead of the original 699).

# check reduced dataset 
str(data_biopsy)
'data.frame':   683 obs. of  9 variables:
 $ V1: int  5 5 3 6 4 8 1 2 2 4 ...
 $ V2: int  1 4 1 8 1 10 1 1 1 2 ...
 $ V3: int  1 4 1 8 1 10 1 2 1 1 ...
 $ V4: int  1 5 1 1 3 8 1 1 1 1 ...
 $ V5: int  2 7 2 3 2 7 2 2 2 2 ...
 $ V6: int  1 10 2 4 1 10 10 1 1 1 ...
 $ V7: int  3 3 3 3 3 9 3 3 1 2 ...
 $ V8: int  1 2 1 7 1 7 1 1 1 1 ...
 $ V9: int  1 1 1 1 1 1 1 1 5 1 ...

🟠 LOGISTIC REGRESSION: EXAMPLE of SUPERVISED ML ALGORITHM

https://github.com/sws8/biopsy-analysis/blob/main/biopsy_analysis.pdf https://www.linkedin.com/pulse/logistic-regression-dataset-biopsy-giancarlo-ronci-twpke/

🟠 K-MEANS CLUSTERING: EXAMPLE of UNSUPERVISED ML ALGORITHM

PCA: EXAMPLE of UNSUPERVISED ML ALGORITHM

Reducing high-dimensional data to a lower number of variables

Calculate Principal Components

The first step of PCA is to calculate the principal components. To accomplish this, we use the prcomp() function from the stats package.

  • With argument “scale = TRUE” each variable in the biopsy data is scaled to have a mean of 0 and a standard deviation of 1 before calculating the principal components (just like option Autoscaling in MetaboAnalyst)
# calculate principal component
biopsy_pca <- prcomp(data_biopsy, 
                     # standardize variables
                     scale = TRUE)

Analyze Principal Components

Let’s check out the elements of our obtained biopsy_pca object

  • (All accessible via the $ operator)
names(biopsy_pca)
[1] "sdev"     "rotation" "center"   "scale"    "x"       

“sdev” = the standard deviation of the principal components

“sdev”^2 = the variance of the principal components (eigenvalues of the covariance/correlation matrix)

“rotation” = the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors).

“center” and “scale” = the means and standard deviations of the original variables before the transformation;

“x” = the principal component scores (after PCA the observations are expressed in principal component scores)

Analyze Principal Components (cont.)

We can see the summary of the analysis using the summary() function

  1. The first row gives the Standard deviation of each component, which can also be retrieved via biopsy_pca$sdev.
  2. The second row shows the Proportion of Variance, i.e. the percentage of explained variance.
summary(biopsy_pca)
Importance of components:
                          PC1     PC2     PC3     PC4     PC5     PC6     PC7
Standard deviation     2.4289 0.88088 0.73434 0.67796 0.61667 0.54943 0.54259
Proportion of Variance 0.6555 0.08622 0.05992 0.05107 0.04225 0.03354 0.03271
Cumulative Proportion  0.6555 0.74172 0.80163 0.85270 0.89496 0.92850 0.96121
                           PC8     PC9
Standard deviation     0.51062 0.29729
Proportion of Variance 0.02897 0.00982
Cumulative Proportion  0.99018 1.00000

Proportion of Variance for components

  1. The row with Proportion of Variance can be either accessed from summary or calculated as follows:
# a) Extracting Proportion of Variance from summary
summary(biopsy_pca)$importance[2,]
    PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9 
0.65550 0.08622 0.05992 0.05107 0.04225 0.03354 0.03271 0.02897 0.00982 
# b) (same thing)
round(biopsy_pca$sdev^2 / sum(biopsy_pca$sdev^2), digits = 5)
[1] 0.65550 0.08622 0.05992 0.05107 0.04225 0.03354 0.03271 0.02897 0.00982


The output suggests the 1st principal component explains around 65% of the total variance, the 2nd principal component explains about 9% of the variance, and this goes on with diminishing proportion for each component.

Cumulative Proportion of variance for components

  1. The last row from the summary(biopsy_pca), shows the Cumulative Proportion of variance, which calculates the cumulative sum of the Proportion of Variance.
# Extracting Cumulative Proportion from summary
summary(biopsy_pca)$importance[3,]
    PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9 
0.65550 0.74172 0.80163 0.85270 0.89496 0.92850 0.96121 0.99018 1.00000 


Once you computed the PCA in R you must decide the number of components to retain based on the obtained results.

VISUALIZING PCA OUTPUTS

Scree plot

There are several ways to decide on the number of components to retain.

  • One helpful option is visualizing the percentage of explained variance per principal component via a scree plot.
    • Plotting with the fviz_eig() function from the factoextra package
# Scree plot shows the variance of each principal component 
factoextra::fviz_eig(biopsy_pca, 
                     addlabels = TRUE, 
                     ylim = c(0, 70))


Visualization is essential in the interpretation of PCA results. Based on the number of retained principal components, which is usually the first few, the observations expressed in component scores can be plotted in several ways.

Scree plot

The obtained scree plot simply visualizes the output of summary(biopsy_pca).

Principal Component Scores

After a PCA, the observations are expressed as principal component scores.

  1. We can retrieve the principal component scores for each Variable by calling biopsy_pca$x, and store them in a new dataframe PC_scores.
  2. Next we draw a scatterplot of the observations – expressed in terms of principal components
# Create new object with PC_scores
PC_scores <- as.data.frame(biopsy_pca$x)
head(PC_scores)

It is also important to visualize the observations along the new axes (principal components) to interpret the relations in the dataset:

Principal Component Scores

        PC1         PC2         PC3         PC4         PC5         PC6
1  1.469095 -0.10419679  0.56527102  0.03193593 -0.15088743 -0.05997679
2 -1.440990 -0.56972390 -0.23642767  0.47779958  1.64188188  0.48268150
3  1.591311 -0.07606412 -0.04882192  0.09232038 -0.05969539  0.27916615
4 -1.478728 -0.52806481  0.60260642 -1.40979365 -0.56032669 -0.06298211
5  1.343877 -0.09065261 -0.02997533  0.33803588 -0.10874960 -0.43105416
6 -5.010654 -1.53379305 -0.46067165 -0.29517264  0.39155544 -0.11527442
         PC7        PC8          PC9
1 -0.3491471 -0.4200360 -0.005687222
2  1.1150819 -0.3792992  0.023409926
3 -0.2325697 -0.2096465  0.013361828
4  0.2109599  1.6059184  0.182642900
5 -0.2596714 -0.4463277 -0.038791241
6 -0.3842529  0.1489917 -0.042953075

Principal Component Scores plot (adding label variable)

  1. When data includes a factor variable, like in our case, it may be interesting to show the grouping on the plot as well.
  • In such cases, the label variable class can be added to the PC set as follows.
# retrieve class variable
biopsy_no_na <- na.omit(biopsy)
# adding class grouping variable to PC_scores
PC_scores$Label <- biopsy_no_na$class


The visualization of the observation points (point cloud) could be in 2D or 3D.

Principal Component Scores plot (2D)

The Scores Plot can be visualized via the ggplot2 package.

  • grouping is indicated by argument the color = Label;
  • geom_point() is used for the point cloud.
ggplot(PC_scores, 
       aes(x = PC1, 
           y = PC2, 
           color = Label)) +
  geom_point() +
  scale_color_manual(values=c("#245048", "#CC0066")) +
  ggtitle("Figure 1: Scores Plot") +
  theme_bw()

Principal Component Scores plot (2D)

Figure 1 shows the observations projected into the new data space made up of principal components

Principal Component Scores (2D Ellipse Plot)

Confidence ellipses can also be added to a grouped scatter plot visualized after a PCA. We use the ggplot2 package.

  • grouping is indicated by argument the color = Label;
  • geom_point() is used for the point cloud;
  • the stat_ellipse() function is called to add the ellipses per biopsy group.
ggplot(PC_scores, 
       aes(x = PC1, 
           y = PC2, 
           color = Label)) +
  geom_point() +
  scale_color_manual(values=c("#245048", "#CC0066")) +
  stat_ellipse() + 
  ggtitle("Figure 2: Ellipse Plot") +
  theme_bw()

Principal Component Scores (2D Ellipse Plot)

Figure 2 shows the observations projected into the new data space made up of principal components, with 95% confidence regions displayed.

Principal Component Scores plot (3D)

A 3D scatterplot of observations shows the first 3 principal components’ scores.

  • For this one, we need the scatterplot3d() function of the scatterplot3d package;
  • The color argument assigned to the Label variable;
  • To add a legend, we use the legend() function and specify its coordinates via the xyz.convert() function.
# 3D scatterplot ...
plot_3d <- with(PC_scores, 
                scatterplot3d::scatterplot3d(PC_scores$PC1, 
                                             PC_scores$PC2, 
                                             PC_scores$PC3, 
                                             color = as.numeric(Label), 
                                             pch = 19, 
                                             main ="Figure 3: 3D Scatter Plot", 
                                             xlab="PC1",
                                             ylab="PC2",
                                             zlab="PC3"))

# ... + legend
legend(plot_3d$xyz.convert(0.5, 0.7, 0.5), 
       pch = 19, 
       yjust=-0.6,
       xjust=-0.9,
       legend = levels(PC_scores$Label), 
       col = seq_along(levels(PC_scores$Label)))

Principal Component Scores plot (3D)

Figure 3 shows the observations projected into the new 3D data space made up of principal components.

Biplot: principal components v. original variables

Next, we create another special type of scatterplot (a biplot) to understand the relationship between the principal components and the original variables.
In the biplot each of the observations is projected onto a scatterplot that uses the first and second principal components as the axes.

  • For this plot, we use the fviz_pca_biplot() function from the factoextra package
    • We will specify the color for the variables, or rather, for the “loading vectors”
    • The habillage argument allows to highlight with color the grouping by class
factoextra::fviz_pca_biplot(biopsy_pca, 
                repel = TRUE,
                col.var = "black",
                habillage = biopsy_no_na$class,
                title = "Figure 4: Biplot", geom="point")

Biplot: principal components v. original variables

The axes show the principal component scores, and the vectors are the loading vectors

Interpreting biplot output

Biplots have two key elements: scores (the 2 axes) and loadings (the vectors). As in the scores plot, each point represents an observation projected in the space of principal components where:

  • Biopsies of the same class are located closer to each other, which indicates that they have similar scores referred to the 2 main principal components;
  • The loading vectors show strength and direction of association of original variables with new PC variables.

As expected from PCA, the single PC1 accounts for variance in almost all original variables, while V9 has the major projection along PC2.

Interpreting biplot output (cont.)

scores <- biopsy_pca$x

loadings <- biopsy_pca$rotation
# excerpt of first 2 components
loadings[ ,1:2] 
          PC1         PC2
V1 -0.3020626 -0.14080053
V2 -0.3807930 -0.04664031
V3 -0.3775825 -0.08242247
V4 -0.3327236 -0.05209438
V5 -0.3362340  0.16440439
V6 -0.3350675 -0.26126062
V7 -0.3457474 -0.22807676
V8 -0.3355914  0.03396582
V9 -0.2302064  0.90555729

Recap of the workshop’s content

TOPICS WE COVERED

  1. Motivated the choice of learning/using R for scientific quantitative analysis, and lay out some fundamental concepts in biostatistics with concrete R coding examples.

  2. Consolidated understanding of inferential statistic, through R coding examples conducted on real biostatistics research data.

  3. Discussed the relationship between any two variables, and introduce a widely used analytical tool: regression.

  4. Presented a popular ML technique for dimensionality reduction (PCA), performed both with MetaboAnalyst and R.

  5. Introduction to power analysis to define the correct sample size for hypotheses testing and discussion of how ML approaches deal with available data.

Final thoughts

  • While the workshop only allowed for a synthetic overview of fundamental ideas, it hopefully provided a solid foundation on the most common statistical analysis you will likely run in your daily work:

    • Thorough understanding of the input data and the data collection process
    • Univariate and bivariate exploratory analysis (accompanied by visual intuition) to form hypothesis
    • Upon verifying the assumptions, we fit data to hypothesized model(s)
    • Assessment of the model performance (\(R^2\), \(Adj. R^2\), \(F-Statistic\), etc.)
  • You should now have a solid grasp on the R language to keep using and exploring the huge potential of this programming ecosystem

  • We only scratched the surface in terms of ML classification and prediction models, but we got a hang of the fundamental steps and some useful tools that might serve us also in more advanced analysis